;;; -*- Mode: TEXT -*-
;;; File: AutoClass:doc;preparation.text
;;;————————————————————————–;;;
;;; AUTOCLASS 3.0 Released 5/11/90 contact: Taylor@pluto.arc.nasa.gov ;;;
;;; by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor ;;;
;;; NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035 ;;;
;;; ;;;
;;; Copyright (C) 1990 Research Institute for Advanced Computer Science. ;;;
;;; All rights reserved. The RIACS Software Policy contains specific ;;;
;;; terms and conditions on the use of this software, and must be ;;;
;;; distributed with any copies. THIS FILE MAY BE REDISTRIBUTED. This ;;;
;;; copyright and notice must be preserved in all copies made of this file. ;;;
;;;————————————————————————–;;;
PREPARING DATA FOR AUTOCLASS
1.0 Introduction
1.1 Applicable Types of Data
1.2 Probability Models
1.3 Input Files
1.3.1 Data File
1.3.2 Header File
1.3.2.1 Header File Example
1.3.3 Model File
1.3.3.1 Model File Example
1.4 Checking Input Files
1.0 Introduction
This documentation file is directed at anyone who will be preparing data
bases for AutoClass 3.0. It requires no statistics or Artificial Intelligence
background, just basic knowledge of the Lisp language.
1.1 Applicable Types of Data
AutoClass is applicable to observations of things that can be described by
a set of features or properties, without referring to other things. This
allows us to represent the observations by a data vector corresponding to a
fixed attribute set. Attributes are names of measurable or distinguishable
properties of the things observed. The data values corresponding to each
attribute are thus limited to be either numbers or the elements of a fixed set
of attribute specific symbols. With numeric data, a fixed measurement error
is assumed and must be provided with the attribute description. AutoClass
cannot express relationships between things because such relationships are not
a property of the thing itself. Nor can AutoClass deal with properties
expressed as sets of values. However the current models do allow for missing
or unknown values. The program itself imposes no specific limit on the number
of data, but databases having more than 10^4 attribute values may require
excessive search time.
Note that there are techniques for re-expressing some data types in forms
acceptable to AutoClass. If a set valued property is limited to subsets of a
set of symbols, one can re-express the property as a set of binary attributes,
one for each of the possible symbols. Temporal ordering data can be expressed
as "time of (year, week, day)" or "time elapsed since ...". And one can
always indicate that a relation has been observed, even if the related thing
cannot be named. A simple example of the later is the transformation of
`married-to' to `married?'.
1.2 Probability Models
The current models assume that attributes are conditionally independent
given the class. Thus within each class the probability that an instance of
the class will have a particular value of any attribute depends only on the
class and is independent of all other attribute values. The probability that
the class would produce any particular instance is then the product of the
individual attribute probability terms. At present it is not possible to
model relations between instances that are not conditional on the class alone.
This is a limit of the current likelihood model set and will be corrected in a
future release.
We use a multinomial model term for discrete attributes of nominal,
ordered, and circular subtypes (all are currently handled identically). This
model term allows any number of values including missing. We use a Gaussian
normal model term for real numerical attributes, or any representing
measurements. There are actually two versions, one of which allows for the
possibility of missing values. There is also an `ignore' model term for
attributes which are not to be considered in generating the classification.
The set of currently available model terms is the value of *model-term-types*,
generated as the model files are loaded.
1.3 Input Files
An AutoClass data base resides in two files. There is a a header file
(default type "hd2" from *header-file-type*) that describes the specific data
format and attribute definitions. The actual data values are in a data file
(default type "db2" from *data-file-type*). We use two files to allow editing
of data descriptions without having to deal with the entire data set. This
makes it easy to experiment with different descriptions of the database
without having to reproduce the data set. Internally, an AutoClass database
structure is identified by it's header and data files, and the number of data
loaded. The set of currently loaded data bases may be found at *db-list*.
A classification of a data base is made with respect to a model which
specifies the form of the probability distribution function for classes in
that data base. Normally the model structure is defined in a model file
(default type "model" from *model-file-type*), containing one or more models.
Internally, a model is always defined relative to a particular database. Thus
it is identified by the corresponding database, the model's model file and
it's sequential position in the file. A specific model may be used by any
number of simultaneous classifications of the data base. A model file may be
used with any number of data bases to produce specific models for those
databases. See *model-list* for the currently loaded models.
1.3.1 Data File
The format of the data file is that of data objects (datum) terminated by
the end of the file. The number of values for each data object must be equal
to the number of attributes defined in the header file. There is an implied
#after each data object. Note that data objects may be either
vectors, lists, or groups of tokens delimited by #or #.
Missing attribute values in the data file may be represented by either 'nil,
#, or other symbols specified in the header file. The internal
representation of a missing value is 'nil for all data types. Individual data
values may be numbers (both integer and floating point), strings, or symbols.
Any lisp readable object may be used as the value of an attribute which is
typed as 'dummy in the header and is ignored by the models.
Example: (data-syntax :vector)
#(7.8674307 33.311752 0.6008e03 10 1 1)
#(5.3936334 30.08755 0.6634e03 4 2 1)
#(6.838643 39.646942 0.6115e03 2 1 1)
#(5.4278746 26.337687 nil 0 ? 1)
Example: (data-syntax :list)
("Dry Rot" 35.18797 0.5388601 3 1 1)
("Wet Rot" 26.803675 0.53074133 5 1 1)
(nil 27.456902 0.5660058 ? 2 1)
("All Rot" 38.981537 0.62709737 7 1 1)
Example: (data-syntax :line)
white 38.991306 0.54248405 2 2 1
red 25.254923 0.5010235 9 2 1
yellow 32.407973 nil 8 2 1
all-white 28.953982 0.5267696 0 1 1
1.3.2 Header File
The header file specifies the data file format, the definitions of the data
attributes, and optional discrete attribute value translations. The value
translations for discrete type attributes provide a level of data abstraction.
For example, if the data values for an attribute are 1, 2, 3, .. 9; and their
meanings are "New York", "Chicago", "Los Angeles", ....; then a translator can
be defined such that the influence values report or the cross-reference by
class report (discussed in file reports.text) will use the string names,
instead of the integers. The header file contains function calls to
DEFINE-DATA-FILE-FORMAT and DEFINE-ATTRIBUTE-DEFINITIONS, and optionally to
DEFINE-DISCRETE-TRANSLATIONS.
Note that if you are working in Symbolics Lisp or TI Explorer Lisp, then the
file's "mode line" package argument should be AUTOCLASS (*ac-pkg*) to get
function argument definitions, and to allow the file to be loaded rather than
read. The header file functional specification follows:
(DEFINE-DATA-FILE-FORMAT *** REQUIRED ***
(&key number-of-attributes
(separator-chars '(# )) ;; add other characters, as needed
(comment-chars '(# )) ;; add other characters, as needed
(unknown-tokens '(? nil)) ;; add other symbols, as needed
(data-syntax :line) ;; one of :vector, :list, or :line
(data-base *input-data-base*)));; used only when called directly
Note: :separator-chars, :comment-chars, & :unknown-tokens, when specified,
do not need to include the default characters.
(DEFINE-ATTRIBUTE-DEFINITIONS *** REQUIRED ***
'<Attribute Descriptors List>)
Attribute Descriptors declare how to interpret attribute values.
A descriptor applies to an attribute (index), or a list of attribute
indices or to the symbol 'default.
Duplication of an attribute index will cause a break.
Omitted attributes will either receive the specified default or be declared
to 'dummy. A warning message will be generated by the AutoClass file
reading functions for any unspecified attributes which are set to 'dummy.
Each descriptor is a list of:
Attribute index (zero based), or list of indices, or 'default.
Attribute type. Must be a property indicator in the list *att-type-data*
Attribute sub-type. Must be an indicator in the property value of type.
Attribute description string.
List of type and subtype specific property type and value pairs. See
*att-type-data* for the available properties. Others will be added.
Currently available combinations:
type sub-type property type(s)
—- ——– —————
dummy none – real location error
real scalar zero-point rel-error
real scalar zero-point error
discrete nominal range
discrete ordered range ordering
discrete circular range ordering
An example is given in 1.3.2.1. Note that the last three combinations
will be handled identically, until appropriate specializations of the
multinomial model have been developed. The value of *Att-type-data*
gives the relations that are currently in effect. The commented out
portions of the definition indicate possible future directions in
attribute type representations.
(DEFINE-DISCRETE-TRANSLATIONS *** OPTIONAL ***
'<Discrete Attribute Translations List>)
This only applies for 'discrete type attributes, and will optionally be
constructed for you from the data, if not supplied. However, the data
abstraction feature will be then not be available.
<Discrete Attribute Translations List>:
Each translation is a list of:
Discrete Attribute index (zero based), or list of indices, or 'default.
Zero or more pairs of input-form output-form translators.
Alternately the translations pair list may be given as the 'translations
property of the attribute definition descriptor. Such translations will take
precedence.
1.3.2.1 Header File Example
(define-data-file-format :number-of-attributes 25
:separator-chars '(# )
:comment-chars '(#||)
:unknown-tokens '(unk)
:data-syntax :vector)
(define-attribute-definitions
'((0 dummy nil "True class, range = 1 - 3" (range 3))
(1 real location "X location, m. in range of 25.0 - 40.0" (error .25))
(2 real location "Y location, m. in range of 0.5 - 0.7" (error .05))
(3 real scalar "Weight, kg. in range of 5.0 - 10.0"
(zero-point 0.0 rel-error .001))
(4 discrete nominal "Truth value, range = 1 - 2" (range 2))
(5 discrete nominal "Color of foobar, 10 values" (range 10))
(6 discrete ordered "Spectral color group"
(range 6 ordering (r o y g b v)))
(7 discrete circular "Points of Compass"
(range 8 ordering (N NE E SE S SW W NW)))
((8 9 10 11 12 13 14 15 16 17 18 19 20) real scalar
"spectral intensity" (zero-point 0.0 rel-error .001))
(default discrete nominal "logical noise" (range 2))))
(define-discrete-translations
'((5 (n brown) (b buff) (c cinnamon) (g gray) (r green) (p pink)
(u purple) (e red) (w white) (y yellow))
(6 (r red) (o orange) (y yellow) (g green) (b blue) (v violet))
(4 (1 false) (2 true))
(7 (N North) (NE Northeast) (E East) (SE Southeast) (S South)
(SW Southwest) (W West) (NW Northwest))
(default (0 false) (1 true))))
1.3.3 Model File
The model file contains data describing the model(s) that will be used for
the classification. This file is read, not loaded. Each model is specified
by a list of model group lists. Each model group list associates some
attributes with a model term type.
Each model group list consists of:
An interaction term type (one of *model-term-types*).
Zero or more attribute set lists of attribute indices, or the symbol
'default.
Notes:
At least one model description list is required.
There may be multiple entries in a model for any model term type.
An attribute index alone is equivalent to a singleton attribute set.
An attribute index must not appear more than once in a model list.
Ignore is not a valid 'default model term type.
*Model-Term-Types* currently looks like this:
(single-multinomial single-normal-cn single-normal-cm ignore)
See the corresponding "model-<model-term-type>.lisp file for detailed
model descriptions.
Single-Multinomial models discrete attributes as multinomials.
Single-Normal-cn models real valued attributes as normals.
Single-Normal-cm models real valued attributes with missing values.
Ignore allows the model to ignore an attribute.
1.3.3.1 Model File Example
A model list suitable for the above header file follows. Note that since
all of the current model terms take single attributes, single indices have
been substituted for the attribute set lists needed for multiple terms:
((ignore 0)
(single-normal-cn 1 2 3)
(single-multinomial 4 5 6 7)
(single-normal-cm 'default)
(single-multinomial 21 22 23 24))
The following illustrates how multiple attribute terms will be handled:
((ignore 0)
(single-normal-cn 1)
(multi-normal-cn (2 3))
(single-multinomial 4 7)
(joint-multinomial (5 6) (21 22 23 24))
(sparse-multi-normal-cm (8 9 10 11 12 13 14 15 16 17 18 19 20)))
1.4 Checking Input Files
A function named AUTOCLASS-INPUT-CHECK is provided to check the validity
of a set of data, header, and model files without initiating a classification
search. Thus errors and warnings can be dealt with prior to beginning the
search, hopefully contributing to a more useful classification search. A
history of the error and warning messages is saved, by default, in a log file.
The input argument key-word list of this function is:
data-file (header-file "") (model-file "") (log-file-p t)
output-files-default (reread t) (regenerate t) n-data
It reads and returns the data base and model(s) defined by :data-file,
:header-file, and :model-file. The :data-file value must be a fully qualified
pathname. The :header-file value can be a fully qualified pathname, a file
name (its root will default to that of :data-file), or not provided at all
(its pathname will default to that of :data-file). The :model-file behaves in
the same manner as :header-file. File name extensions (file types) for
:data-file, :header-file, and :model-file are forced to canonical values by
the AutoClass program:
:data-file "db2" (defined by *data-file-type*)
:header-file "hd2" (defined by *header-file-type*)
:model-file "model" (defined by *model-file-type*)
If :log-file-p is t and :output-files-default is nil, the log file will be
named by default "<data-file-name>&<header-file-name>&<model-file-name>", and
will have the same root as :data-file. Specifying :output-files-default as a
file name (e.g. "my-log-file") will override the default name. Specifying it
as a path name (e.g. "my-record-dir/my-log-file") will override the pathname
of :data-file, as well. If :log-file-p is nil, then no log file is generated.
The log file is created with keyword options: :if-exists :append and
:if-does-not-exist :create, so that multiple sessions of AUTOCLASS-INPUT-CHECK
will result is only one log file, as long as only the version numbers, and not
the file names, of :data-file, :header-file and :model-file, change. The file
extension of the log file is forced to "log" (defined by *log-file-type*).
Note that :output-files-default is also an argument to AUTOCLASS-SEARCH
and GENERATE-CLSF, so it can be used to give names to all your output files
("log", "search", and "dump"/"results") consistent names for a particular
classification run.
The switches :reread and :regenerate are defaulted to t to force complete
re-reading of the data file, and re-generation of the models. This
incorporates changes and corrections which you make to the data file, the
model file, and the header file. N-data, if supplied, allows the reading of
less than the full data file. This is useful when the data-file is very large
and you are just interested in validating the header and model file contents.
However, once this is accomplished, invoke the function CLEAR-STORES to clear
out the stored short data base. This assures that subsequent processing using
either AUTOCLASS-INPUT-CHECK, AUTOCLASS-SEARCH, or GENERATE-CLSF will reload
the complete data base and regenerate the model.
All advisory, warning, and error messages are output to the screen. And to
the log file, providing that the :log-file-p argument is t (the default).
Advisory messages are output to provide information which is not crucial to
the continuance of the run. Warning messages contain information which may
affect the quality of the run. However, the default condition is to NOT stop
the run when one or more warning messages are generated. The Common Lisp
global variable *break-on-warning* controls this functionality: binding this
variable to t will cause AUTOCLASS-SEARCH or GENERATE-CLSF to "break" on
warning messages. The function AUTOCLASS-INPUT-CHECK does not generate a
classification or invoke a search, hence it is finished when it outputs any
warning messages. Error messages are fatal, and if generated during the
invocation of GENERATE-CLSF or AUTOCLASS-SEARCH, the run state will be
terminated.